WebDoubler - User's Guide

	Version 1.1 User's Guide

Pre-Cache: Preloading Remote Web Sites	Previous \| Next Contents

WebDoubler offers a "Pre-Cache" function, which allows the WebDoubler administrator to select certain Web sites for pre-loading at a scheduled time. For example, a teacher can tell WebDoubler that a certain Web site will be accessed by students in a computer lab the next day. During the night, when no other activity is taking place on the school's Internet connection, WebDoubler will "crawl" the entire site, caching it's complete contents. The next day, students will have access to the remote site at full LAN speeds, even if the Internet connection fails.

Pre-cached sites can also be periodically re-crawled so that they stay current and are available immediately from cache indefinitely. This allows the WebDoubler administrator to improve service to remote Web sites that are popular with local users. For example, a large organization with multiple Web servers can set these servers to be periodically crawled. Users at each location will have LAN speed access to remote servers, and the content will remain fresh.

Getting Started

To use the Pre-Cache plug-in, open a Web browser and access the WebDoubler Administration home page as described in the "Plug-In Administration" section of the WebDoubler User's Guide. Click the "WebDoubler Pre-Cache Administration" link to open the Pre-Cache Administration page.

When you first access this page, no sites will be available in the "Pre-Cached Web Sites" list. Click the "New Pre-Cache Site..." link to open the detail page and specify a site to be crawled and cached.

Figure 48: The Pre-Cache Editor Page

Basic Setup

The first section of the pre-cache detail page determines general setup information. Give the pre-cache site a descriptive name in the "Site Name" field. For example, if you were going to crawl the Maxum Web site, you might choose a name of "Maxum Development". Next, enter the full URL of the Web site's home page.

The rest of the items in the "Setup" section serve to keep the crawler under control. First, the "Delay Between Hits" causes WebDoubler to wait between page accesses to avoid overloading the remote server (and possibly your own Internet connection). If the server you are crawling is your own, you may set this value to anything you like, including "0" for no delay at all. If the server WebDoubler will be crawling is run by someone else, you should be certain to avoid overloading the server and annoying the Webmaster.

You may also limit the number of items (Web pages, graphics, and other files) and the total amount of data that will be downloaded during the crawl. This is important to avoid overloading your own hard drive and keeps WebDoubler from crawling out of control should a problem occur. (For example, mistakes in configuration, unknown problems in HTML files, and certain CGI processes can all lead to out of control crawlers.)

Select maximums for the total "Number Of Items" and the "Size Of Site". If you know that the site being crawled is large, and you are certain that you want the entire site loaded, increase the default values for these fields accordingly. If you are certain the site is fairly small, you may want to lower the defaults to reduce the chance of the crawl going out of control.

WebDoubler will crawl items in the order it encounters them, so even when the crawl reaches a maximum and fails to load a site completely, it will normally download the "best" items. Site menus and other "high-level" information will generally appear near "the top" of a Web site, so it will be crawled by WebDoubler first and therefore have a higher chance of being included in the cache.

Scheduling The Crawler

The "Scheduling The Crawler" configuration settings allow you to specify when WebDoubler will perform the crawl, and how often the crawl will be repeated thereafter. Simply enter a date and time for the "Next Crawl" to schedule when WebDoubler should begin crawling. Choose from the available options in the "Re-Crawl" pop-up to choose how often WebDoubler should refresh the cache.

If you are crawling a site for use just one time, leave the "Next Crawl" date at the current date and set the time for "11:00 PM" or some other time when network usage is low. If you would like the crawl to begin as soon as you have completed defining it, then leave the date and time at the default (the current date and time). Finally, choose "Never" for the "Re-Crawl" time period.

Crawler Control

Keeping the crawler on the site you are trying to cache is absolutely crucial, and is done by specifying the items in the "Crawler Control" section. This section begins with the "Allowed Crawl Paths" which specifies where the crawler is allowed to go. The path defined by the "Home Page URL", described above, will automatically be added to the list of allowed paths. Other paths can be added to allow the crawler to load additional folders or servers.

For example, the Maxum Development Web site starts at "http://www.maxum.com/", but much of the site is served from other servers at "http://examples.maxum.com/" and "http://search.maxum.com/". By default, the Web crawler will not crawl to Web servers other than the one specified in the Home Page URL, so pages on "examples.maxum.com" and "search.maxum.com" will be ignored. To have these pages included, you must specifically allow WebDoubler to crawl these servers by adding "http://examples.maxum.com/" and "http://search.maxum.com" to the list of "Allowed Crawl Paths".

Once you have specified all the servers and folders WebDoubler is allowed to go, additional restrictions may be placed to cut out unnecessary crawling. For example, your own organization's Web server may maintain a series of log files of past access. These log files may be very large and completely irrelevant for caching purposes. In this case, the entire folder of log files can be skipped by the crawler by adding something like "http://www.mycompany.com/logs/" to the "Disallowed Crawl Paths" list.

Similarly, you can tell WebDoubler to ignore files based on their filename extension. (The filename extension is the last few characters of a filename, beginning with a period, that tells Web servers and clients what kind of file is being served.) For example, to avoid caching large download files, you could add ".zip", ".exe", ".sit" and ".hqx" to the "Ignored Filename Extensions" list.

Both the "Disallowed Crawl Paths" and "Ignored Filename Extensions" lists restrict WebDoubler downloads completely, regardless of the paths specified in the "Allowed Crawl Paths" list. For a file to be pre-cached by WebDoubler, it must be located in an "Allowed" path, but if the file is also in a "Disallowed" path or has an "Ignored" extension, it will not be loaded into the cache.

Completing The Crawl

Once the crawler has been defined, click the "Save Changes" button. The crawler will run at the specified time, and once it has, pre-cached files will automatically be served from the cache.

It is important to note that the pre-cache functions separately from the standard WebDoubler cache. For pages, graphics, etc. loaded into the pre-cache, the normal rules about caching and avoiding stale content are ignored. If a file is in WebDoubler's pre-cache, it will be served from cache regardless of it'\s age. Items in the pre-cache will never be removed until the site is automatically recrawled or the WebDoubler administrator deletes the pre-cached Web site from WebDoubler. If it is important that the content of a particular site is always up to date, then you should not use the pre-cache to load the site.

When you no longer need pre-cache access to a site WebDoubler has crawled, be sure to delete it. Entire Web sites can use a lot of disk space on your WebDoubler server, and it is important to free up this space. To delete the pre-cache stored site, use the "Delete Site" button at the bottom of the Pre-Cache Editor page.

Crawler Operation Details

WebDoubler is capable of pre-caching multiple Web sites simultaneously. If you define more than one pre-cache site, and their schedules overlap, up to 4 sites may be crawled at one time.

Also, be aware that running a Web crawler is not a trivial process and you should keep a few simple rules in mind. First, be very careful not to overload remote Web sites. Modern Web servers are very fast and it is unlikely that WebDoubler will have a serious impact on the performance of a remote Web site. However, be sure to allow at least a few seconds delay in between hits (configured in the "Basic Setup" section) while WebDoubler is crawling.

You should also always respect the wishes of the Webmasters who run the sites you are crawling. If you are asked not to crawl a site, don't pre-cache it. You should also be careful not to crawl sites repeatedly. If you are just testing out WebDoubler's pre-cache capability, for example, use your own Web server for crawling, or one run by someone you know so you can ask permission first.

WebDoubler does follow the "Robot Exclusion Standard", which means that when it first contacts a remote site, it will request a standard file called "Robots.txt". The crawler will then use the information in the file to determine what portions of the Web site have been deemed accessible to crawlers by the Web site administrator. Most Webmasters will use the "Robots.txt" file to help WebDoubler stay directed toward the content on the site that is generally available for user Web access, avoiding portions of the site that are not suitable for crawling. Some Webmasters, however, may use this file to deny access to their Web site to crawlers completely. WebDoubler will respect this decision, and will not crawl sites that use the exclusion.


Copyright © 1999 Maxum Development Corporation http://www.maxum.com/	Previous \| Next Contents